Skip to content

Conversation

ritter-x2a
Copy link
Member

@ritter-x2a ritter-x2a commented Jun 27, 2025

If we can't fold a PTRADD's offset into its users, lowering them to
disjoint ORs is preferable: Often, a 32-bit OR instruction suffices
where we'd otherwise use a pair of 32-bit additions with carry.

This needs to be a DAGCombine (and not a selection rule) because its
main purpose is to enable subsequent DAGCombines for bitwise operations.
We don't want to just turn PTRADDs into disjoint ORs whenever that's
sound because this transform loses the information that the operation
implements pointer arithmetic, which AMDGPU for instance needs when
folding constant offsets.

For SWDEV-516125.

@llvmbot
Copy link
Member

llvmbot commented Jun 27, 2025

@llvm/pr-subscribers-llvm-selectiondag

@llvm/pr-subscribers-backend-amdgpu

Author: Fabian Ritter (ritter-x2a)

Changes

If we can't fold a PTRADD's offset into its users, lowering them to
disjoint ORs is preferable: Often, a 32-bit OR instruction suffices
where we'd otherwise use a pair of 32-bit additions with carry.

This needs to be a DAGCombine (and not a selection rule) because its
main purpose is to enable subsequent DAGCombines for bitwise operations.
We don't want to just turn PTRADDs into disjoint ORs whenever that's
sound because this transform loses the information that the operation
implements pointer arithmetic, which we will soon need to fold offsets
into FLAT instructions. Currently, disjoint ORs can still be used for
offset folding, so that part of the logic can't be tested.

The PR contains a hacky workaround for a situation where an AssertAlign
operand of a PTRADD is not DAGCombined before the PTRADD, causing the
PTRADD to be turned into a disjoint OR although reassociating it with
the operand of the AssertAlign would be better. This wouldn't be a
problem if the DAGCombiner ensured that a node is only processed after
all its operands have been processed.

For SWDEV-516125.


Full diff: https://github.com/llvm/llvm-project/pull/146075.diff

2 Files Affected:

  • (modified) llvm/lib/Target/AMDGPU/SIISelLowering.cpp (+35)
  • (modified) llvm/test/CodeGen/AMDGPU/ptradd-sdag-optimizations.ll (+55-1)
diff --git a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
index 822bab88c8a09..71230078edc69 100644
--- a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
+++ b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
@@ -15136,6 +15136,41 @@ SDValue SITargetLowering::performPtrAddCombine(SDNode *N,
       return Folded;
   }
 
+  // Transform (ptradd a, b) -> (or disjoint a, b) if it is equivalent and if
+  // that transformation can't block an offset folding at any use of the ptradd.
+  // This should be done late, after legalization, so that it doesn't block
+  // other ptradd combines that could enable more offset folding.
+  bool HasIntermediateAssertAlign =
+      N0->getOpcode() == ISD::AssertAlign && N0->getOperand(0)->isAnyAdd();
+  // This is a hack to work around an ordering problem for DAGs like this:
+  //   (ptradd (AssertAlign (ptradd p, c1), k), c2)
+  // If the outer ptradd is handled first by the DAGCombiner, it can be
+  // transformed into a disjoint or. Then, when the generic AssertAlign combine
+  // pushes the AssertAlign through the inner ptradd, it's too late for the
+  // ptradd reassociation to trigger.
+  if (!DCI.isBeforeLegalizeOps() && !HasIntermediateAssertAlign &&
+      DAG.haveNoCommonBitsSet(N0, N1)) {
+    bool TransformCanBreakAddrMode = any_of(N->users(), [&](SDNode *User) {
+      if (auto *LoadStore = dyn_cast<MemSDNode>(User);
+          LoadStore && LoadStore->getBasePtr().getNode() == N) {
+        unsigned AS = LoadStore->getAddressSpace();
+        // Currently, we only really need ptradds to fold offsets into flat
+        // memory instructions.
+        if (AS != AMDGPUAS::FLAT_ADDRESS)
+          return false;
+        TargetLoweringBase::AddrMode AM;
+        AM.HasBaseReg = true;
+        EVT VT = LoadStore->getMemoryVT();
+        Type *AccessTy = VT.getTypeForEVT(*DAG.getContext());
+        return isLegalAddressingMode(DAG.getDataLayout(), AM, AccessTy, AS);
+      }
+      return false;
+    });
+
+    if (!TransformCanBreakAddrMode)
+      return DAG.getNode(ISD::OR, DL, VT, N0, N1, SDNodeFlags::Disjoint);
+  }
+
   if (N1.getOpcode() != ISD::ADD || !N1.hasOneUse())
     return SDValue();
 
diff --git a/llvm/test/CodeGen/AMDGPU/ptradd-sdag-optimizations.ll b/llvm/test/CodeGen/AMDGPU/ptradd-sdag-optimizations.ll
index 893deb35fe822..64e041103a563 100644
--- a/llvm/test/CodeGen/AMDGPU/ptradd-sdag-optimizations.ll
+++ b/llvm/test/CodeGen/AMDGPU/ptradd-sdag-optimizations.ll
@@ -100,7 +100,7 @@ define void @baseptr_null(i64 %offset, i8 %v) {
 
 ; Taken from implicit-kernarg-backend-usage.ll, tests the PTRADD handling in the
 ; assertalign DAG combine.
-define amdgpu_kernel void @llvm_amdgcn_queue_ptr(ptr addrspace(1) %ptr)  #0 {
+define amdgpu_kernel void @llvm_amdgcn_queue_ptr(ptr addrspace(1) %ptr) {
 ; GFX942-LABEL: llvm_amdgcn_queue_ptr:
 ; GFX942:       ; %bb.0:
 ; GFX942-NEXT:    v_mov_b32_e32 v2, 0
@@ -416,6 +416,60 @@ entry:
   ret void
 }
 
+; Check that ptradds can be lowered to disjoint ORs.
+define ptr @gep_disjoint_or(ptr %base) {
+; GFX942-LABEL: gep_disjoint_or:
+; GFX942:       ; %bb.0:
+; GFX942-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX942-NEXT:    v_and_or_b32 v0, v0, -16, 4
+; GFX942-NEXT:    s_setpc_b64 s[30:31]
+  %p = call ptr @llvm.ptrmask(ptr %base, i64 s0xf0)
+  %gep = getelementptr nuw inbounds i8, ptr %p, i64 4
+  ret ptr %gep
+}
+
+; Check that AssertAlign nodes between ptradd nodes don't block offset folding,
+; taken from preload-implicit-kernargs.ll
+define amdgpu_kernel void @random_incorrect_offset(ptr addrspace(1) inreg %out) #0 {
+; GFX942_PTRADD-LABEL: random_incorrect_offset:
+; GFX942_PTRADD:       ; %bb.1:
+; GFX942_PTRADD-NEXT:    s_load_dwordx2 s[2:3], s[0:1], 0x0
+; GFX942_PTRADD-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX942_PTRADD-NEXT:    s_branch .LBB21_0
+; GFX942_PTRADD-NEXT:    .p2align 8
+; GFX942_PTRADD-NEXT:  ; %bb.2:
+; GFX942_PTRADD-NEXT:  .LBB21_0:
+; GFX942_PTRADD-NEXT:    s_load_dword s0, s[0:1], 0xa
+; GFX942_PTRADD-NEXT:    v_mov_b32_e32 v0, 0
+; GFX942_PTRADD-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX942_PTRADD-NEXT:    v_mov_b32_e32 v1, s0
+; GFX942_PTRADD-NEXT:    global_store_dword v0, v1, s[2:3]
+; GFX942_PTRADD-NEXT:    s_endpgm
+;
+; GFX942_LEGACY-LABEL: random_incorrect_offset:
+; GFX942_LEGACY:       ; %bb.1:
+; GFX942_LEGACY-NEXT:    s_load_dwordx2 s[2:3], s[0:1], 0x0
+; GFX942_LEGACY-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX942_LEGACY-NEXT:    s_branch .LBB21_0
+; GFX942_LEGACY-NEXT:    .p2align 8
+; GFX942_LEGACY-NEXT:  ; %bb.2:
+; GFX942_LEGACY-NEXT:  .LBB21_0:
+; GFX942_LEGACY-NEXT:    s_mov_b32 s4, 8
+; GFX942_LEGACY-NEXT:    s_load_dword s0, s[0:1], s4 offset:0x2
+; GFX942_LEGACY-NEXT:    v_mov_b32_e32 v0, 0
+; GFX942_LEGACY-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX942_LEGACY-NEXT:    v_mov_b32_e32 v1, s0
+; GFX942_LEGACY-NEXT:    global_store_dword v0, v1, s[2:3]
+; GFX942_LEGACY-NEXT:    s_endpgm
+  %imp_arg_ptr = call ptr addrspace(4) @llvm.amdgcn.implicitarg.ptr()
+  %gep = getelementptr i8, ptr addrspace(4) %imp_arg_ptr, i32 2
+  %load = load i32, ptr addrspace(4) %gep
+  store i32 %load, ptr addrspace(1) %out
+  ret void
+}
+
 declare void @llvm.memcpy.p0.p4.i64(ptr noalias nocapture writeonly, ptr addrspace(4) noalias nocapture readonly, i64, i1 immarg)
 
 !0 = !{}
+
+attributes #0 = { "amdgpu-agpr-alloc"="0" "amdgpu-no-completion-action" "amdgpu-no-default-queue" "amdgpu-no-dispatch-id" "amdgpu-no-dispatch-ptr" "amdgpu-no-heap-ptr" "amdgpu-no-hostcall-ptr" "amdgpu-no-lds-kernel-id" "amdgpu-no-multigrid-sync-arg" "amdgpu-no-queue-ptr" "amdgpu-no-workgroup-id-x" "amdgpu-no-workgroup-id-y" "amdgpu-no-workgroup-id-z" "amdgpu-no-workitem-id-x" "amdgpu-no-workitem-id-y" "amdgpu-no-workitem-id-z" "uniform-work-group-size"="false" }

@ritter-x2a ritter-x2a marked this pull request as ready for review June 27, 2025 13:37
@ritter-x2a ritter-x2a force-pushed the users/ritter-x2a/06-27-_amdgpu_sdag_dagcombine_ptradd_-_disjoint_or branch from f0f708e to f7b627f Compare July 11, 2025 08:26
@ritter-x2a ritter-x2a force-pushed the users/ritter-x2a/06-26-_sdag_amdgpu_allow_opting_in_to_oob-generating_ptradd_transforms branch 2 times, most recently from 249fbdf to 293652f Compare July 18, 2025 08:09
@ritter-x2a ritter-x2a force-pushed the users/ritter-x2a/06-27-_amdgpu_sdag_dagcombine_ptradd_-_disjoint_or branch 2 times, most recently from 44a3786 to 9773b09 Compare July 18, 2025 09:34
@ritter-x2a ritter-x2a force-pushed the users/ritter-x2a/06-26-_sdag_amdgpu_allow_opting_in_to_oob-generating_ptradd_transforms branch 2 times, most recently from 4a0dcfa to ea26451 Compare July 21, 2025 12:59
@ritter-x2a ritter-x2a force-pushed the users/ritter-x2a/06-27-_amdgpu_sdag_dagcombine_ptradd_-_disjoint_or branch from 778ff99 to a21320e Compare August 1, 2025 14:33
@ritter-x2a ritter-x2a force-pushed the users/ritter-x2a/06-27-_amdgpu_sdag_dagcombine_ptradd_-_disjoint_or branch from 72b3fb8 to 9540bad Compare August 7, 2025 08:42
@ritter-x2a ritter-x2a force-pushed the users/ritter-x2a/06-26-_sdag_amdgpu_allow_opting_in_to_oob-generating_ptradd_transforms branch from cb86ff7 to 732927d Compare August 7, 2025 08:42
@ritter-x2a ritter-x2a force-pushed the users/ritter-x2a/06-27-_amdgpu_sdag_dagcombine_ptradd_-_disjoint_or branch from 9540bad to 5d56cf4 Compare August 13, 2025 13:50
@ritter-x2a ritter-x2a force-pushed the users/ritter-x2a/06-26-_sdag_amdgpu_allow_opting_in_to_oob-generating_ptradd_transforms branch from 732927d to 443ee1d Compare August 13, 2025 13:50
@ritter-x2a ritter-x2a force-pushed the users/ritter-x2a/06-27-_amdgpu_sdag_dagcombine_ptradd_-_disjoint_or branch from 5d56cf4 to ddab3cb Compare September 11, 2025 09:51
@ritter-x2a ritter-x2a force-pushed the users/ritter-x2a/06-26-_sdag_amdgpu_allow_opting_in_to_oob-generating_ptradd_transforms branch from 443ee1d to 85582d6 Compare September 11, 2025 09:51
@ritter-x2a ritter-x2a force-pushed the users/ritter-x2a/06-26-_sdag_amdgpu_allow_opting_in_to_oob-generating_ptradd_transforms branch from 85582d6 to 3414f36 Compare September 18, 2025 09:40
@ritter-x2a ritter-x2a force-pushed the users/ritter-x2a/06-27-_amdgpu_sdag_dagcombine_ptradd_-_disjoint_or branch 2 times, most recently from 0d05dcd to e65d9d4 Compare September 18, 2025 14:25
@ritter-x2a ritter-x2a force-pushed the users/ritter-x2a/06-26-_sdag_amdgpu_allow_opting_in_to_oob-generating_ptradd_transforms branch 2 times, most recently from ee13ff4 to 0af16c7 Compare September 19, 2025 07:46
@ritter-x2a ritter-x2a force-pushed the users/ritter-x2a/06-27-_amdgpu_sdag_dagcombine_ptradd_-_disjoint_or branch from e65d9d4 to 0587908 Compare September 19, 2025 07:47
@ritter-x2a ritter-x2a force-pushed the users/ritter-x2a/06-26-_sdag_amdgpu_allow_opting_in_to_oob-generating_ptradd_transforms branch from 0af16c7 to 32c1ec6 Compare September 19, 2025 08:21
@ritter-x2a ritter-x2a force-pushed the users/ritter-x2a/06-27-_amdgpu_sdag_dagcombine_ptradd_-_disjoint_or branch from 0587908 to 2c94e57 Compare September 19, 2025 08:22
Base automatically changed from users/ritter-x2a/06-26-_sdag_amdgpu_allow_opting_in_to_oob-generating_ptradd_transforms to main September 19, 2025 09:08
If we can't fold a PTRADD's offset into its users, lowering them to
disjoint ORs is preferable: Often, a 32-bit OR instruction suffices
where we'd otherwise use a pair of 32-bit additions with carry.

This needs to be a DAGCombine (and not a selection rule) because its
main purpose is to enable subsequent DAGCombines for bitwise operations.
We don't want to just turn PTRADDs into disjoint ORs whenever that's
sound because this transform loses the information that the operation
implements pointer arithmetic, which we will soon need to fold offsets
into FLAT instructions. Currently, disjoint ORs can still be used for
offset folding, so that part of the logic can't be tested.

The PR contains a hacky workaround for a situation where an AssertAlign
operand of a PTRADD is not DAGCombined before the PTRADD, causing the
PTRADD to be turned into a disjoint OR although reassociating it with
the operand of the AssertAlign would be better. This wouldn't be a
problem if the DAGCombiner ensured that a node is only processed after
all its operands have been processed.

For SWDEV-516125.
@ritter-x2a ritter-x2a force-pushed the users/ritter-x2a/06-27-_amdgpu_sdag_dagcombine_ptradd_-_disjoint_or branch from 2c94e57 to 77a0419 Compare September 19, 2025 09:10
Copy link
Member Author

ritter-x2a commented Sep 19, 2025

Merge activity

  • Sep 19, 9:57 AM UTC: A user started a stack merge that includes this pull request via Graphite.
  • Sep 19, 9:58 AM UTC: @ritter-x2a merged this pull request with Graphite.

@ritter-x2a ritter-x2a merged commit d560769 into main Sep 19, 2025
9 checks passed
@ritter-x2a ritter-x2a deleted the users/ritter-x2a/06-27-_amdgpu_sdag_dagcombine_ptradd_-_disjoint_or branch September 19, 2025 09:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backend:AMDGPU llvm:SelectionDAG SelectionDAGISel as well

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants